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Abstract 


We are concerned with obtaining novel concen¬ 
tration inequalities for the missing mass, i.e. 
the total probability mass of the outcomes not 
observed in the sample. We not only derive - for 
the first time - distribution-free Bernstein-like 
deviation bounds with sublinear exponents in 
deviation size for missing m ass, but also improve 


deviation size tor missing mass, but alsoimprove 
the results of McAlleste r and Ortizl ( 20031 and 


Berend and Kontorovichl ( 2013 . 2012 ) for small 
deviations which is the most interesting case in 
learning theory. It is known that the majority of 
standard inequalities cannot be directly used to 
analyze heterogeneous sums i.e. sums whose 
terms have large difference in magnitude. Our 
generic and intuitive approach shows that the 
heterogeneity issue introduced in 


McAllester and Ortizl ( 20031) is resolvable at 
least in the case of missing mass via regulating 
the terms using our novel thresholding technique. 


1 INTRODUCTION 

Missing mass is the total probability associated to the 
outcomes that have not been seen in the sample which is 
one of the important quantities in machine learning and 
statistics. It connects density estimates obtained from a 
given sample to the population for discrete distributions: 
the less the missing mass, the more useful the information 
that can be extracted from the dataset. Roughly speaking, 
the more the missing mass is the less we can discover about 
the true unknown underlying distribution which would im¬ 
ply the less we can statistically generalize to the whole 
population. In other words, missing mass measures how 
representative a given dataset is assuming that it has been 
sampled according to the true distribution. 


important approaches in such studies involves bounding 
the fluctuations of the random variable around a certain 
quantity namely its mean. Concentration inequalities are 
powerful tools for performing analysis of this type. Let X 
be any non-negative real-valued random variable with finite 
mean. The goal is to establish for any e > 0, probability 
bounds of the form 


F(X-E[X] < -e) <exp(—»»(£)), 

P(X - E[X] > e) < exp(—r/ u (e)), (1) 

where r/i(e) and r) u (e) are some non-decreasing functions 
of e and where it is desirable to find the largest such func¬ 
tions for variable A' and for the ‘target’ interval of e. These 
bounds are commonly called lower and upper deviations 
bounds respectively. In most practical scenarios, we are in 
a non-asymptotic setting where we have access to a sam¬ 
ple Xi, ...,X n and we would like to derive concentration 
inequalities that explicitly describe dependence on sample 
size n. Namely, we would like to obtain bounds of the form 


P(X - E[X] < -e) < exp(-77z(e,n)), 

P(X - E[X] > e) < exp(-77„(e,n)), (2) 


where r]i(e,n) and rj u (e,n) are both non-decreasing func¬ 
tions of e and n. Many of such bounds are distribution-free 
i.e. they hold irrespective of the underlying distribution. 


McAllester and Schapire ( 2000t) established concentration 
inequalities for the missing mass for the first time. A 
follow-up work by McAllester and Ortizl (2003) pointed 
out inadequacy of standard inequalities, developed a ther¬ 
modynamical viewpoint for addressing this issue and 
sharpened these bounds. Berend and Kontorovichl (201.3) 
further refined the bounds via arguments si milar to Kearns- 
Saul inequality ( Keams a nd Saul dl998) ) and logarith¬ 
mic Sobolev inequality (Boucher on et al. ( 20131) ). These 
previous works, however, not only involve overly specific 
approaches to concentration and handling heterogeneity 
issue but also do not yield sharp bounds for small devia¬ 
tions which is the most interesting case in learning theory. 


Often, one is interested in understanding the behaviour In this paper, we shall derive distribution-free concentra- 
of the missing mass as a random variable. One of the tion inequalities for missing mass in a novel way. The 












































primary objective of our approach is to introduce a no¬ 
tion of heterogeneity control which allows us to regulate 
the magnitude of bins in histogram of the discrete distri¬ 
bution being analyzed. This mechanism in turn enables us 
to control the behaviour of central quantities such as the 
variance or martingle differences of the random variable in 
question. These are the main quantities that appear in stan¬ 
dard concentration inequalities such as Bernstein, Bennett 
and McDiarmid just to name a few. Consequently, instead 
of discovering a new method for bounding fluctuations of 
each random variable of interest, we will be able to directly 
apply standard inequalities to obtain probabilistic bounds 
on many discrete random variables including missing mass. 


Choosing the representation (0} for missing mass, one has 

E y]z = y= y^tu,(i - Wj) n , (5) 

ieZ iGZ 

rMx = S>? VAR [Yi\, ( 6 ) 

iez 

V 2 I ■= ^WzVAR [Yi], (7) 

iez 

where we have introduced the weighted variance notation 
cT 2 and where each quantity is attached to a set over which 
it is defined. Note that VAR [Yi] is the individual variance 
corresponding to Y r which is defined as 


The rest of the paper is structured as follows. Section[2]con- 
tains the background information and introduces the nota¬ 
tions. Section [3] outlines motivations and the main con¬ 
tributions. In Section |4] we explain negative dependence, 
information monotonicity and develop a few fundamental 
tools whereas Section [5] presents the proofs of our upper 
and lower deviation bounds based on these tools. Finally, 
Section [6] concludes the paper and compares our bounds 
with existing results for small deviations. 

2 PRELIMINARIES 

In this section, we will provide definitions, notations and 
and other background material. 

Consider P : X —> [0,1] to be a fixed but unknown discrete 
distribution on some finite or countable non-empty set I 
with \I\ = N. Let {wi : i £ 1} be the probability (or 
frequency) of drawing the ?'-th outcome. Moreover, sup¬ 
pose that we observe an i.i.d. sample {X,}" =1 from this 
distribution with n being the sample size. Now, missing 
mass is defined as the total probability mass corresponding 
to the outcomes that are not present in our sample. Namely, 
missing mass is a random variable that can be expressed as: 

Y~Y,WiYu (3) 

iex 


VAR [Yi\ = <ft(l - qi ) = (1 - Wi ) n (l - (1 - WiY). (8) 


One can define the above quantities not just over the set X 
but on some (proper) subset of it that may depend on or be 
described by some variable(s) of interest. For instance, in 
our proofs the variable 9 may be responsible for choosing 
Is C 1 over which the above quantities will be evaluated. 
For lower deviation and upper deviation, we find it conve¬ 
nient to refer to the associated set by C and U respectively. 
Likewise, we will use subscripts l and u to refer to ob¬ 
jects that characterize lower deviation and upper deviation 
respectively. Also, we use the notation Y vl = Y,,.... Yj 
to refer to sequence of variables whose index starts at i-th 
variable and ends at j-th variable. Finally, other notation or 
definitions may be introduced within the body of the proof 
when required. 


We will encounter Lambert H-'-fu nction - also known as 
product logarithm function - in this paper which describes 
the inverse relation of fix) = xe x and which cannot be 
expressed in terms of elementary functions. This func¬ 
tion is double-valued when x £ R. However, it becomes 
invertible in restricted domain. The lower branch of it is 
denoted by W_ i (.), which is the only branch that will 
prove beneficial in this paper. The reader is advised to refer 
to Corless et al. ( 19961) for a detailed treatment. 


where we define each {Tj : i g 1} to be a Bernoulli 
variable that takes on 0 if the i-th outcome exists in the 
sample and 1 otherwise. Namely, we have 

Yi = l[(Xi54i)A(X 27 4*)A-A {x n ^i)}- (4) 

We assume that for all i £ X, Wi > 0 and Yhiei w i = 1- 
Denote P(T) = 1) = (ft and P(Y) = 0) = 1 — eft and let 
us suppose that Yi s are independent: as we will see later in 
this section, such an assumption will not impose a burden 
on our proof structure and flow. Hence, we will have that 
qi = E[Ti] = (1 - Wi) n < e~ nWi where qi £ (0,1). 
Namely, defining / : (1 ,n) — > (e _n ,4) C (0,1) where 
f(9) = e~ 6 with 9 £ Df and taking Wi > — amounts to 
) < f(9). This provides a basis for our‘thresholding’ 
technique that we will employ in our proof. 


Throughout the paper, we shall use the convention that 
capital letters refer to random variables whereas lower case 
letters correspond to realizations thereof. 


We will utilize Bernstein’s inequality in our derivation. 
Suitable representations of this result are outlined below 
without the proof. 


Theorem. [Bernstein] Let Z\,Zn be independent 
zero-mean random variables such that one has \Zi\ < a 
almost surely for all i. Then, using Bernstein’s inequality 
(Bernstein (1924)) one obtains for all e > 0: 


N 

nY, z i> y < ex p ( 

i -1 


2 (V + \ote) 


( 9 ) 


where V = E [^i 2 ]- 













Now, consider the sample mean Z = n 1 X^"=i 
and let a 1 be the sample variance, namely 
a 2 := n- 1 E” = i VAR [Zi\ = n-^JLrE^i 2 ]. So, 
using 0 with n ■ e in the role of e, we get 


P (Z > e) < exp ^ — 


2 (a 2 + \ae) 


( 10 ) 


If Z\,.... Z n are, moreover, not just independent but also 
identically distributed, then cr 2 is equal to a 2 i.e. the 
variance of each Zi. The latter presentation makes explicit: 
( 1 ) the exponential decay with n; ( 2 ) the fact that for a 2 < e 
we get a t ail probability with exponent of o rder n e rather 
than ne 2 ( Lugosil ( 20031) : Boucheron et al. ( 2013 )) which 
has the potential to yield stronger bounds for small e. 


variance a 2 as in (ITOl) which is tight exactly in the 
important case when a 2 is small, and in which the 
denominator in ( ITOl is specified by a factor depending on 
e; in the special case of the missing mass, this factor turns 
out to be logarithmic in e and a free parameter 7 as it will 
become clear later. 

We derive - using Bernstein’s inequality - novel bounds 
on missing mass that take into account explicit variance 
information with more accurate scaling and demonstrate 
their superiority for small deviations. 

3.2 Main Results 

Consider the following functions 


3 MOTIVATIONS AND MAIN RESULTS 

In this section, we motivate this work by pointing out the 
heterogeneity challenge and how we approach it. Our 
bounds also improve the functional form of the exponent, 
which is of independent significance. In the final part of 
this section, we summarize our main results. 


3.1 The Challenge and the Remedy 


McAllester and Ortiz! (12003 ) point out that for highly 
heterogeneous sums of the form ([3]), the standard form of 
Bernstein’s inequality ([9]) does not lead to concentration in¬ 
equalities of form ( ITOl ): at least for the upper deviation of 
the missing mass, 0 does not imply any non-trivial bounds 
of the form (0. The reason is basically the fact that the 
can vary wildly: some can be of order 0(l/n), other may 
be constants independent of n. For similar reasons, other 
standard inequalities such as Bennett, Angluin-Valiant and 
Hoeffding cannot be used to get bounds on the missing 
mass of the form 0 either (McAllester and Ortiz (2003)). 


Having pointed out the deficiency of these standard in¬ 
equalities, McAllester and Ortiz (2003) succeed in giving 
bounds of the form 0 on the missing mass, for a function 
rj(e,n) oc ne 2 , both with a direct argument a n d usin g 
the Kearns-Saul inequality (IKearns and Saull (119981) ). 
Recently, the constants app earing in the bounds were 
refined by Berend and Kontorovich ) (2013). The 
bounds proven by [M c Allester and Ortizl ( 2003 ) and 
Berend and Kontorovich! (h()13 ) are qualitatively similar 
to Hoeffding bounds for i.i.d. random variables: they do 
not improve the functional form from ne 2 to ne for small 
variances. 




"( e ) = 


3(7e - 1) 
5 7e 2 ' 


( 11 ) 

( 12 ) 


Let Y denote the missing mass, n the sample size and e the 
deviation size. 

Theorem 1. For any 0 < e < 1 and any n > |~ 7 e ] — 1, we 
obtain the following upper deviation bound 

P(F - E[Y] > e) < e~ c(e) ' ne . (13) 


Theorem 2. For any 0 < e < 1 and any n > [~ 7 e ] — 1, we 
obtain the following lower deviation bound 

P (Y - E [Y] < -e) < e~ c(e) ' ne . (14) 


Corollary 1. For any 0 < e < 1 and any n > [ 7 ,=] — 1, 
using union bound we obtain the following deviation bound 

P(|y-E[Y]| > e) < 2e~ c(e) ' ne . (15) 


The proof of the above theorems is provided in Section 0 
However, let us develop a few tools in SectionQ]which will 
be used later in our proofs. 


4 NEGATIVE DEPENDENCE AND 
INFORMATION MONOTONICITY 

Probabilistic analysis of most random variables and specif¬ 
ically the derivation of the majority of probabilistic bounds 
rely on independence assumption between variables which 
offers considerable simplification and convenience. Many 
random variables including the missing mass, however, 
consist of random components that are not independent. 


This leaves open the question whether it is also possi¬ 
ble to derive bounds which are more reminiscent of the 
Bernstein bound for i.i.d. random variables (ITOt which 
does exploit variance. In this paper, we show that the 
answer is a qualified yes: we give bounds that depend on 
weighted variance <r 2 defined in 0 rather than sample 


Fortunately, even in cases where independence does not 
hold, one can still use some standard tools and methods 
provided variables are dependent in specific ways. The 
following notions of dependence are among the common 
ways that prove useful in these settings: negative 
association and negative regression. 

































4.1 Negative Dependence and Chernoff’s Exponential 
Moment Method 

Our proof involves variables with a specific type of 
dependence known as negative association. One can 
infer concentration of sums of negatively associated ran¬ 
dom variables from the concentration of sums of their 
independent copies in certain situations. In exponential 
moment method, this property allows us to treat such 
variables as independent in the context of probability 
inequalities as we shall elaborate later in this section. 

In the sequel, we present negative association and 
regression and supply tools that will be essential in proofs. 

Negative Association: Any real-valued random variables 
Xi and X -2 are negatively associated if 

E[X!X 2 ] <E[X{\ -E[A 2 ], (16) 

More generally, a set of random variables Xx ,..., X m are 
negatively associated if for any disjoint subsets A and B of 
the index set {1, ...,m}, we have 

E [XiXj] < E[Xi] ■ E[Xj] for i £ A, j £ B. (17) 

Stochastic Domination: Assume that X and Y are real¬ 
valued random variables. Then, X is said to stochastically 
dominate Y if for all a in the range of A' and Y we have 

P{X >a)> P(Y > a). (18) 


exponential moment method by drawing a connection be¬ 
tween deviation probability of a discrete random variable 
and Chernoff’s entropy of a related distribution. 

We provide a self-standing account by presenting the proof 
for some of these existing results as well as developing 
several generic tools that are applicable beyond missing 
mass problem. 

Lemma 1. [Binary Stochastic Monotonicity] Let Y be 

a binary random variable (Bernoulli) and let X take on 
values in a totally ordered set X. Then, one has 

Y±X => X l Y. (21) 

Proof. For any x, we have 

P(Y = 1\X <x)> inf P(Y = 1\X = a) 

a<.x 

> sup P{Y = 1\X = a) 

a>x 

>P(Y=l\X>x). (22) 

The above argument implies that random variables Y 
and lx>x are negatively associated and since the 
expression P(X > x\ Y = 1) < P(X > x\Y = 0) 
holds for all x £ X. it follows that X j. Y. □ 

Lemma 2. [Independent Binary Negative Regression] 

Let Xi ,..., X rn be negatively associated random variables 
and Yi,..., Y m be binary random variables (Bernoulli) such 
that either Y, j, X\ or Y, f X, holds for all z £ {1,..., to}. 
Then Y ±,..., Y rn are negatively associated. 


We use the notation X X Y to reflect ( ITSl ) in short. 

Stochastic Monotonicity: A random variable Y is 

stochastically non-decreasing in random variable X if 

xx<x 2 =► P(Y\X = xi) < P(Y\X = x 2 ). (19) 


Similarly, Y is stochastically non-increasing in X if 


x x < x 2 => P(Y\X = Xl ) > P(Y\X = x 2 ). (20) 


The notations (Y\X = Xi) A (Y\X = xf) and 
(Y\X = X\) A (Y\X = xf) represent the above 
definitions using the notion of stochastic domination. Also, 
we will use shorthands Y \ X and Y l A to refer to the 
relations described by (fl9l > and d20l) respectively. 


Negative Regression: Random variables X and Y have 
negative regression dependence relation if X } Y. 


Dubhashi_andRan[an ( 1998 ) as well as 
Joag-Dev and Proschan ( 19831) summarize numerous 


notable properties of negative association and negative 
regression. Specifically, the former provides a proposition 
that indicates that Hoeffding-Chernoff bounds apply to 
sums_of negatively associated random variables. Further, 


McAllester and Ortiz ( 2003 ) generalize these observations 


to essentially any concentration result derived based on the 


Proof. For any disjoint subsets A and B of {1 ,...,to}, 
taking i £ A and j £ B we have 

nYfYj] = EpEp^l*!,..., X m \] (23) 

= E[E[Y i \X i ]-E[Y j \X j ]\ (24) 

<E[E[TM] -EfEK-IA,-]] (25) 

= E[Lj] • E[Yj]. (26) 

Here, (|24] > holds since each Y, only depends on A’j. 
Inequality (|25] > follows because X , and X :j are negatively 
associated and we have E[lj|Ai] = P(Y i \X i ). □ 


Lemma 3. [Chernoff] For any real-valued random vari¬ 
able X with finite mean E[A] and for any x > 0, we have: 

DP(X, x) < exp {-S{X, x)), (27) 

S(X, x) = sup{Ax — In (Z(X, A))}, (28) 

A 

Z(X,X) =E[e xx }. (29) 


The lemma follows from the observation that for A > 0, we 
have the following 

. E[e AX ] 


P(X >x)= P(e xx > e Xx ) < inf 




(30) 


Phis app roach is known as exponential moment method 


Chernoff (1952)) because of the inequality in ( l30t . 





















Lemma 4. [Negative Association] In the exponential 
moment method, concentration of sums of negatively 
associated random variables can be deduced from the 
concentration of sums of their independent copies. 

Proof. Let X-[,.... X m be any set of negatively 
associated variables. Let X\..... X' Tn be independent 
shadow variables, i.e., independent variables such that each 
X[ is distributed identically to A’,. Let X = fff X, and 
X' = fff X[. For any set of negatively associated random 
variables, one has S(X, e) > S(X', e) since: 

m 

Z(X, A) = E[e AX ] = f\\ e XXi ] 

i 

m 

< J|E[e AXi ] = E[e A ' Y '] = Z(X', A). (31) 

i 

The lemma is due to iMcAllester and Ortiz! (120031) which 
follows from definition of entropy function S given by (l28l i. 

□ 


which concludes the proof for N = 2. Now, tak¬ 
ing f(Ci ) = Ci and g(Ci) = n - J2jjti C j where 

n = Yl!k=\ Ck, for N > 2 the same argument implies 
that Ci and Cj are negatively associated for all i £ M and 
j £ AT \ i. That is to say, any increase in Ci will cause 
a decrease in some or all of Cj variables with j f i and 
vice versa. It is easy to verify that the same is true for any 
disjoint subsets of the set {Ci,..., Cn}- □ 

Lemma 6. [Monotonicity] For any negatively 
associated random variables X- t ,X rn and any 
non-decreasing functions f \..... f m , we have that 
fi(Xi),..., f m (X rn ) are negatively associated. The same 
holds if the functions /j,..., /„, were non-increasing. 

Remark: The proof is in the same spirit as that of associa¬ 
tion inequality (1331) and motivated by composition rules for 
monotonic functions that one can repeatedly apply to (l32t . 

Lemma 7. [Union] The union of independent sets of 
negatively associated random variables yields a set of 
negatively associated random variables. 


This lemma is very helpful in the context of large 
deviation bounds: it implies that one can treat neg¬ 
atively associated variables as if they were indepen¬ 
dent J McAnesterancTOrtiz] ( 2003 ): Dubhashi and Rani an 
dl998tf ). 


Lemma 5. [Balls and Bins] Let S be any sample 
comprising n items drawn i.i.d. from a fixed distribution 
on integers A f = {1, N} (bins). Define Ci to be the 
number of times that integer i occurs in S. The random 
variables C\..... Cn are negatively associated. 


Proof. Let / and g be non-decreasing and non-increasing 
functions respectively. We have 

(/O) - f(y)) (tfO) - g(y)) < o. (32) 

Further, assume that X is a real-valued random variable 
and Y is an independent shadow variable corresponding to 
X. Exploiting (l32l >. we obtain 


E[f(X)g(X)} < E [f(X)\ ■ E[ 0 (X)], (33) 


which implies that f(X) and g(X) are negatively 
associated. Inequality (|33] > is an instance of Chebychev’s 
fundamental association inequality. 


Now, suppose without loss of generality that N = 2. Take 
X e [0, n], and consider the following functions 


r nx) = x, 

\ g(X) =n-X, 


(34) 


where n = Ci + Cj is the total counts. Since / and 
g are non-decreasing and non-increasing functions of X, 
choosing X = f{Cf) = Ci we have for al I i , j £ AT that 


(35) 


Suppose that X and Y are independent vectors each of 
which comprising a negatively associated set. Then, the 
concatenated vector [X, Y] is negatively associated. 


Proof. Let [A'i, X 2 ] and [Tj, U 2 ] be some arbitrary 
partitions of X and Y respectively and assume that / and 
g are non-decreasing functions. Then, one has 


E [f(X 1 ,Y 1 )-g(X 2 ,Y 2 )\ = 
E[E[f(X 1 ,Y 1 )-g(X 2 ,Y 2 )\Y 1 ,Y 2 ]] < 
E[E[/(Xl, Fi) | Tj] • E [g{X 2 ,Y 2 ) \ Y 2 ]] < 
E[E[/(A 1 ,F 1 ) | Yi]] • E[E[g(X 2 ,Y 2 ) \ Y 2 ]] = 

ElfiXuY^-Elg&M)]. (36) 


The first inequality is due to independence of [Xi,X 2 \ 
from [Y \, Y 2 ] which results in negative association being 
preserved under conditioning and the second inequality 
follows_ because [ Vj. Y 2 are negatively associated 
( Joag-Dev and Proschanl ( 19831) ). The same holds if / and 
g were non-increasing functions. □ 


Lemma 8. [Splitting] Splitting an arbitrary subset of bins 
of any fixed discrete distribution yields a set of negatively 
associated random bins. 


Proof. Let w = (wi,..., w m ) be a discrete distribution and 
W = {Wi ,..., W m j be the associated set of random bins. 
Assume that Wi is split into k bins Wf = {Wi 1 ,..., Wif\ 
such that Wi = X^=i Wij. Then, by Lemma 0 members 
of split set Wf are negatively associated. Clearly, the same 
holds for all 1 < i < m as well as any other subset of set 
W. Moreover, for all 1 < i < m the sets Wf and W \ Wi 
are negatively associated by Lemma0and Lemma[7] □ 


E [Ci ■ Cj] < E[Ci] • E[Cj], 

























Lemma 9. [Absorption] Absorbing any subset of bins of 
a discrete distribution yields negatively associated bins. 

Proof. Let w = (tor, ..., wn) be a discrete distribution 
and let W = {Wi, ...,Wn} be the associated set of 
random bins. Assume without loss of generality that 
W A = {W A , ..., is the absorption-induced 

set of random bins where w n is absorbed to produce 
w A = (w A , ..., w^_i) and where w A = Wi + 
for i = 1, ...,N — 1. So, wn is discarded and we have 
W A = 1 — wn- The rest of the proof concerns 
applying Lemma[5]to the absorb set W A . The same holds 
if we absorb wn to a subset of W \ Wn ■ □ 

4.2 Negative Dependence and the Missing Mass 

For missing mass, the variables Wi = — are negatively 
associated owing to Lemma[5]and linearity of expectation. 
Also, one has Vi : Y t f Wi. So, by Lemma|T]we infer that 
Vi : Wi i Yi. Now, Y\..... Yn are negatively associated 
because they are a set of independent binary variables with 
negative regression dependence (Lemma[2|. Thus, concen¬ 
tration variables 

Zi = wWi - E [wiYi] := ((Yi) are 

negatively associated by Lemma[ 6 ]since we have 


Now, consider the /-divergence Df(p G || q G ) between 
induced probability distributions p G and q G . Information 
monotonicity states that information is lost as we partition 
elements of p and q into groups to produce p G and q G 
respectively. Namely, for any /-divergence one has 


D f (p G \\q G )<D f (p\\q), (40) 


which is due to Csiszar iCsisza rj (1922. 2008): lAmarii 
( 20091) ). This inequality is tight if and only if for any out¬ 
come Xi and partition Gj , we have p(xf\Gj) = q(xi\Gj). 


Lemma 11. [Partitioning] In the exponential moment 
method, one can establish a deviation bound for any dis¬ 
crete random variable X by invoking Chernoff’s method 
on the associated discrete partition random variable X G . 


Formally, assume X and X\ are discrete random variables 
defined on the set X endowed with probability distributions 
p and p\ respectively. Further, suppose that X G and X G 
are discrete variables on a partition set X G endowed with 
p G and p G that are obtained from p and p\ by partitioning 
using some partition G. Then, we have 


Vir > 0 : DP(X,x) < exp(—S(X G ,x)). (41) 


Proof. Let A (a:) be the optimal A in (|28] >. Then, we have 


C(U) = 


-Wiqi if Yi = 0 , 

Wifl-qf) ify» = l. 


(37) 


For all i, ( is a non-decreasing function of Yi. Likewise, 
concentration variables —Zi are 


negatively associated. 


4.3 Information Monotonicity and Partitioning 

Lemma 10. [Information Monotonicity] Let 

p = (p\,...,pn) be a discrete distribution on 

X = (xi,..,xn) such that for 1 < i < N we have 
P(X = xf) = pi. Suppose we partition X into m < N 
non-empty disjoint groups Gi ,..., G m . namely 


S(X, x) = x\(x) - In (Z(X, A(ar))) 

= D KL (P\{ X )W p) 

> D K L(pf {x) \\p G ) 

= S(X G ,x), (42) 

where we have introduced the A-induced distribution 

Xx 

p x( x = X) = P(X = x). (43) 

The inequality step in (l42l > follows from (l40t and the 
observation that Dkl is an instance of /-divergence where 

f(v) = v ln(i>) with v > 0. □ 


X — Li Gi, 

Mif- j: Gi D Gj = 0. (38) 

This is called coarse binning since it generates a new dis¬ 
tribution with groups Gi whose dimensionality is less than 
that of the original distribution. Note that once the distribu¬ 
tion is transformed, considering any outcome Xi from the 
original distribution we will only have access to its group 
membership information; for instance, we can observe that 
it belongs to Gj but we will not be able to recover pi. 

Let us denote the induced distribution over the partition 
G= (G\,..., G m ) by p G = (pf,...,p G ). Clearly, we have 

p? = P(G i )='52 p {*i)- ( 39 ) 

j£Gi 


5 PROOF OF THE MAIN RESULTS 


The central idea of the proof is to regulate the terms in 
the sum given by (0 via controlling the magnitude of bins 
of the distribution using operations that preserve negative 
association. This mechanism will help defeat the hetero¬ 
geneity issue leading to the failure of standard probability 
inequalities described bv lMcAllester and Ortizl (120031) . 


5.1 Proof of Theorem 1: Upper Deviation Bound 

We consider the thresholds r = — and t’ = '— and reduce 

n n 

the problem to one in which all bins that are larger than 
r are eliminated, where 6 £ I will depend on the target 
deviation size e. 





















The reduction is performed by splitting the bins that are 
larger than r and then absorbing the bins that are smaller 
than t. This is followed by choosing a threshold that yields 
the sharpest bound for the choice of e. It turns out that the 
optimal threshold will too be a function of e. 

Let X T C X denote the subset of bins that are at most 
as large as r, Xg the subset of bins whose magnitude is 
between r and t' , X T ‘ the subset of bins larger than t' 
and X' e and I' T , the set of bins that we obtain after splitting 
members of Xg and X T > respectively. 

Now, for each i £l\I T = {Xg UZ,-'} and for some k G N 
that depends on i (but we suppress that notation below), we 
will have that k-r < Wi < (fc + 1) -r. For all such i, we de¬ 
fine extra independent Bernoulli random variables Yjj with 
j G Ji '■= {1,..., k} and their associated bins Wij. For 
j G {1,..., k — 1}, = t and w; ifc = w, - (k - 1) • r. 

In this way, all bins that are larger than r are split up into 
k bins, each of which is in-between r and r'; more pre¬ 
cisely, the first k — 1 are exactly r and the last one may 
be larger. Therefore, we consider the split random variable 

^ ' = X)igX T w i^i + Sig{X^,UX£} YijeJi alU ^ ^e 

set U' = {i| Wi < t'} = {X T L)Xq III';}. Furthermore, 
we introduce the random variable Y" = Wt ^ on 

the absorption-induced set U" = {i\ t < Wi < r'}. 
The set U" is generated from U' as follows: we take the 
largest element j G U' with Wj < r, update wi using 
wi ■<— wi + |^ J _ 1 for {( G U' : l / j, wi < r} and 
discard wj. Repeating this procedure gives a set of bins 
whose sizes are in-between r and r' plus a single bin of size 
smaller than r; absorbing the latter into one of the members 
of the former with size r yields U". 

Now, by choosing 9 such that f{9) = e~ e = — and 
9 = / _1 (^) = ln(7) for any 0 < e < 1 and ee < 7 < e n e 
as generic domain for 7 , we derive the upper deviation 
bound for missing mass as follows 


Clearly, we will have that r* = where 9* = ln(^). 

Inequality (1451) follows because the splitting procedure can¬ 
not decrease deviation probability of missing mass. 

Formally, assume without loss of generality that X \I T 
has only one element corresponding to Yi, J\ = { 1 , 2 } 
and ki = 1 i.e. w\ is split into two parts. Then, 
deviation probability of Y can be thought of as the total 
probability mass associated to independent Bernoulli 
variables Y \,.... Yv whose weighted sum is bounded 
below by some tail t > 0. Hence, we have 

P(Y >t) = E P(Y u ...,Y n ) 

Y 1N ; Y>t 

N 

= e ^)-n^) 

Y 1N -, Y>t i=2 

N 

+ E myjlRK) 

Y 1N ; Y<t ; Y>t i=2 

N 

= e ^)-n^) 

Y 1N -, Y>t i=2 

N 

+ E wyjlw) 

Y 1N ; Y<t ; Y>t , Yi = l i=2 

N 

= e 

Y 2N - Y>t i=2 

N 

+ e *-n R(Yi), (53) 

Y 2N ; Y <t; Y>t i=2 

where Y = Yi >2 w iYi an d R(Xi) = ft if Y, = 1 and 
R(Yi) = 1 — (ji otherwise. Likewise, one can express the 
upper deviation probability of Y' as follows 


P(Y - E[Y] > e) < 

P(Y' - E[Y] > e) = 

P(Y' - E[Y'] + (E[Y'] - E[Y]) > e) < 
P(Y' - E[Y'] + f{9) > e) = 


°(y'-E[Y'] > (-—-)e) 


exp - 


7 




< 


exp 



inf 

7 



3 ne(7- 1) 2 \ t 
IO 7 2 ln( 4 ) J I 


e -c(e)-nc 


< 


(44) 

(45) 

(46) 

(47) 

(48) 

(49) 


(50) 

(51) 

(52) 


P(Y'>f)= £ RiYj flRiY) 

Y 1N - Y>t i=2 

N 

+ ^ (i?(Y n )-R(Y 12 ))nW 

Yll,Yl2,Y 2N Y<t-, Y'>t i=2 

N 

= e a*™ 

y 2N ; Y>t *=2 


+ ^ (R(Yn).R(Y 12 ))nTO 

Y\i,Y 12 ,Y 2N ; L<t; Y’>t *= 2 

> E f[w) 

Y 2N ; Y>t i=2 


N 

E (911-912)11^). 

Y 2N ] Y<t ; Y’>t i=2 


+ 


(54) 







where R(Yij) = q l:j if Y tj = 1 and R(Y zj ) = 1 - q %3 

otherwise. Thus, combining (l53l ) and (l54l > we have 

P (Y' >t)~ P(Y >t)> 

N 

E (?n • 912 — qi ) R ( Yi ) = 

Y 2 N. y <£; Y’>t; Y>t *= 2 

N 

e (911 ■ 912 - 01) n (55) 

Y 2N - Y <t; Y’>t i=2 

To complete the proof for (l45l >. we require the expression 
for the difference between deviation probabilities in (l55T > to 
be non-negative for all t > 0 which holds if qi < qn ■ q\ 2 - 
For the missing mass, this condition holds. Without loss of 
generality, assume that Wi is split into two terms; namely, 
we have in, = Wij + w-ij>. Then, we can check the above 
condition as follows 


and g(x,n) = £ 2 (1 — x) n (l — (1 — x) n ) are non-increasing 
with respect to x on (q-q-j-, 1) and (q-qqj, 1) respectively. We 
obtain for 1 < 9 < n, an upperbound on Vu" as follows: 


V U " = E Wi(l-Wi) n (l-(1-Wi) n ^ 

i: Wi^lA" 

< T E Wi{l - Wi) n ^1 - (1 - Wi) n ^j 

i: Wi€U" 

= T-qh" 

<t E Wi(i-Wi) n 

i : T<Wi<r '; Wj =1 



) 



n 


e 

< — ■ e. 

n 


(59) 


qi = (1 - Wi) n < (1 - mj) n ■ (1 - Wij’Y 

= (l - {wij + Wij') + ) . (56) 

Wi >0 

One can verify using induction that (l56l > holds also for 
cases where the split operation produces more than two 
terms. Now, choosing tail size t = e + EY implies (145| >. 

Inequality (l47l > follows because the gap between the expec¬ 
tations will be negligible. Denoting E[Y/] = q[, we have 


In order to see why ( l52t holds, consider c(7, e) = y 7 ln E 
and let us examine the derivatives as follows 


dc{ 7, e) = 
c >7 

a 7 2 

(7 2 


e 2 ( 7 -l)( 7 -l-21n(7)) 


7 3 In 2 (q) 


7 4 ln 3 (?)L 


(6 - 4 7 ) In 2 (-)+ 


67 + 5 ) In ( —) + 2(7 — 1 )^ 


( 60 ) 


( 61 ) 


{ qi if t . 

Qij if i e {T t , UT'}, (57) 

0 otherwise. 

Namely, we can write 
g u (0) = E[Y'] - E[Y] = E Wj(q' - Qi ) 

iei 


= E mqi + E E Wi i qi i " 

- E WiQi 

iex T 

ie{x},uz'} j&Ji 

tex 

= E 

E "w ~ E 

Wiq t 

ie{Z',UX'} jeJi i£{Z T ,UT e } 


< E 

E w v qi i 


ie{z;,uz'} jeJi 


< E 

E < m- 

( 58 ) 


jeji 


The expression in ( |49| > is Bernstein’s inequality applied 
to the random variable Z u = Yieu" ^i relying upon 
Lemma QT| Here, the concentration variables are 
Zj = w,Yj — E[wiYi] with i £ U" and we set a u = r'. 

Let Vw be variance proxy term V in Bernstein’s inequal¬ 
ity as defined in © attached to U". The functions /. g : 

(0,1) xN -4 (0,1) with f(x, n) = x(l — x) n {l — (1 — x) n ) 


Solving for the first derivative using ( [60b , we obtain 

"‘ = - 2 W -‘(- 27 s)- <62) 

Inspecting the second derivative given by ( l 6 Tb . we can see 
that the function c( 7 , e) is concave with respect to 7 for 
any 7 > 2. Recall, moreover, that there are interrelated 
restrictions on 7, e and n in derivation of ( 15 I [ 1 and (l52l) 
which are collectively expressed as 

max{e-e, 1, 2 , 7 ( 1 )} < 7 < e n , n > \j e ] -1. (63) 

5.2 Proof of Theorem 2: Lower Deviation Bound 

The proof for lower deviation bound proceeds in the same 
spirit as section [57l The idea is again to reduce the problem 
to one in which all bins that are larger than the threshold r 
are eliminated. 

We split large bins and then absorb small bins to enable 
us shrink the variance while controlling the magnitude of 
terms (and consequently the key quantities a and V) before 
applying Bernstein’s inequality. 

By choosing 0 such that f(9) = e~ e so that 6 = ln(-), for 
any 0 < e < 1 with ee < 7 < e n e being generic domain 











for 7 we obtain a lower deviation bound as follows 


p(y - E[y] < - e ) < 


(64) 

p(y' - E[y] < - e ) = 


(65) 

p(y' - E[y'] + (E[y'j - E[y]) < - 

■e)< 

(66) 

p(y' - E[y'] - f(0) < -e) = 


(67) 

p(V' - 

E[y']< ( 7 1 )e] < 

7 / 


(68) 

< exp | 

f (^)V \ 

< 

(69) 

^ 2{V C n + f • (32i±) • e )) 

< exp | 

( (^)V ' 

V- 

(70) 

^ 2 (£-e+f |-(^)- e ) 

";'{ exp ( 10^1n(J) )} 


(71) 

e -c(e)-ne 

5 


(72) 


Bernstein’s inequality. Along the way, we introduced a 
collection of concepts and tools in the intersection of prob¬ 
ability theory and information theory that have the potential 
to be advantageous in more general settings. 


Recall that Bernstein’s inequality hinges on establishing an 
upperbound on Z {X, A) given by ( l29l ) in a particular way. 
Clearly, this choice is not unique and one can choose any 
other upperbound (e.g. c.f. Lugosi! ( 2003 )1 and apply the 
same technique to derive potentially tight bounds achiev¬ 
able within the framework of exponential moment method. 


Our bounds sharpen the leading results for missing mass 
in the case of small deviations. These inequalities hold 
subject to the mild condition that the sample size is large 
enough, namely n > [7,=] — 1. 

We select the best known bounds in Berend and 
Kontorovich (2013) for the comparison. Our lower devi¬ 
ation and upper deviation bounds improve state-of-the-art 
for any 0 < e < 0.021 and any 0 < e < 0.045 respectively. 


where c(e) and t* are as before and domain restrictions are 
determined similar to (f63l) . 


The variables Y' and Y", and the sets CJ and C" are defined 
in the same fashion as Section [5Tl 

The first inequality is proved in the same way as (l45l >. Now, 
we set E[y/] = q[ such that 


( Qi if Wi < r', 

( 0 otherwise. 


(73) 


Inequality d67l > follows because the compensation gap will 
remain small since we have 


gi (6) = nY'}-E[Y} = Y,™M-qi) 

ie x 

= ^ w iQi ~^2 w iqi = - Wiqi 

i\Wi<r' i£T 

>~ E ^/( 0 )>-/( 0 )- (74) 

z:iOi>r / 


Plugging in the definitions, we can see that the 
compensation gap can be expressed as a function of e and 
show that the following holds 

IsOOl < \/e-exp(V_i(^=)), (75) 

where we have dropped the subscript of gap g. Note that 
the gap is negligible for small e compared to large values 
of e for both ( 1521 ) and (l72i i. This observation supports the 
fact that we obtained sharper bounds for small deviations. 

Mathematical analysis of missing mass via concentration 
inequalities has various important applications including 
density estimation, generalization bounds and handling 
missing data just to name a few. Needless to say that any 
refinement in bounds or tools developed for the former may 
directly contribute to advancement in those applications. 
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